sessionInfo()Biostat 203B Homework 3
Due Feb 23 @ 11:59PM
Display machine information for reproducibility:
Load necessary libraries (you can add more as needed).
library(arrow)
Attaching package: 'arrow'
The following object is masked from 'package:utils':
timestamp
library(memuse)
library(pryr)
library(R.utils)Loading required package: R.oo
Loading required package: R.methodsS3
R.methodsS3 v1.8.2 (2022-06-13 22:00:14 UTC) successfully loaded. See ?R.methodsS3 for help.
R.oo v1.25.0 (2022-06-12 02:20:02 UTC) successfully loaded. See ?R.oo for help.
Attaching package: 'R.oo'
The following object is masked from 'package:R.methodsS3':
throw
The following objects are masked from 'package:methods':
getClasses, getMethods
The following objects are masked from 'package:base':
attach, detach, load, save
R.utils v2.12.3 (2023-11-18 01:00:02 UTC) successfully loaded. See ?R.utils for help.
Attaching package: 'R.utils'
The following object is masked from 'package:arrow':
timestamp
The following object is masked from 'package:utils':
timestamp
The following objects are masked from 'package:base':
cat, commandArgs, getOption, isOpen, nullfile, parse, warnings
library(tidyverse)── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr 1.1.3 ✔ readr 2.1.4
✔ forcats 1.0.0 ✔ stringr 1.5.0
✔ ggplot2 3.4.4 ✔ tibble 3.2.1
✔ lubridate 1.9.3 ✔ tidyr 1.3.0
✔ purrr 1.0.2
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ purrr::compose() masks pryr::compose()
✖ lubridate::duration() masks arrow::duration()
✖ tidyr::extract() masks R.utils::extract()
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag() masks stats::lag()
✖ purrr::partial() masks pryr::partial()
✖ dplyr::where() masks pryr::where()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
Display your machine memory.
memuse::Sys.meminfo()In this exercise, we use tidyverse (ggplot2, dplyr, etc) to explore the MIMIC-IV data introduced in homework 1 and to build a cohort of ICU stays.
Q1. Visualizing patient trajectory
Visualizing a patient’s encounters in a health care system is a common task in clinical data analysis. In this question, we will visualize a patient’s ADT (admission-discharge-transfer) history and ICU vitals in the MIMIC-IV data.
Q1.1 ADT history
A patient’s ADT history records the time of admission, discharge, and transfer in the hospital. This figure shows the ADT history of the patient with subject_id 10001217 in the MIMIC-IV data. The x-axis is the calendar time, and the y-axis is the type of event (ADT, lab, procedure). The color of the line segment represents the care unit. The size of the line segment represents whether the care unit is an ICU/CCU. The crosses represent lab events, and the shape of the dots represents the type of procedure. The title of the figure shows the patient’s demographic information and the subtitle shows top 3 diagnoses.
Do a similar visualization for the patient with
subject_id 10013310 using ggplot.
Hint: We need to pull information from data files patients.csv.gz, admissions.csv.gz, transfers.csv.gz, labevents.csv.gz, procedures_icd.csv.gz, diagnoses_icd.csv.gz, d_icd_procedures.csv.gz, and d_icd_diagnoses.csv.gz. For the big file labevents.csv.gz, use the Parquet format you generated in Homework 2. For reproducibility, make the Parquet folder labevents_pq available at the current working directory hw3, for example, by a symbolic link. Make your code reproducible.
ln -s /Users/zihengzhang/Downloads/203B/203b-hw/hw2/labevents_parquet \
/Users/zihengzhang/Downloads/203B/203b-hw/hw3/labevents_pqlabevents_data <- arrow::open_dataset("./labevents_pq",
format = "parquet")patient_id <- 10013310
race_info <- read_csv("~/mimic/hosp/admissions.csv.gz") |>
filter(subject_id == patient_id) |>
select(subject_id, race)Rows: 431231 Columns: 16
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (8): admission_type, admit_provider_id, admission_location, discharge_l...
dbl (3): subject_id, hadm_id, hospital_expire_flag
dttm (5): admittime, dischtime, deathtime, edregtime, edouttime
ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
personal_info <- read_csv("~/mimic/hosp/patients.csv.gz") |>
filter(subject_id == patient_id) |>
select(subject_id, gender, anchor_age) |>
rename(age = anchor_age) |>
left_join(race_info, by = "subject_id") |>
distinct()Rows: 299712 Columns: 6
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (2): gender, anchor_year_group
dbl (3): subject_id, anchor_age, anchor_year
date (1): dod
ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
diagonose_code_info <- read_csv("~/mimic/hosp/d_icd_diagnoses.csv.gz")Rows: 109775 Columns: 3
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (2): icd_code, long_title
dbl (1): icd_version
ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
diagnose_info <- read_csv("~/mimic/hosp/diagnoses_icd.csv.gz") |>
filter(subject_id == patient_id) |>
select(-icd_version) |>
filter(seq_num <= 3) |>
left_join(diagonose_code_info, by = "icd_code") |>
select(-icd_version)Rows: 4756326 Columns: 5
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (1): icd_code
dbl (4): subject_id, hadm_id, seq_num, icd_version
ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
ADT_info <- read_csv("~/mimic/hosp/transfers.csv.gz") |>
filter(subject_id == patient_id) |>
select(subject_id, hadm_id, intime, outtime, eventtype, careunit) |>
filter(eventtype != "discharge") |>
arrange(intime) |>
distinct()Rows: 1890972 Columns: 7
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (2): eventtype, careunit
dbl (3): subject_id, hadm_id, transfer_id
dttm (2): intime, outtime
ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
#mutate(ADT_Type = ifelse(ADT_Type %in% c("Coronary Care Unit (CCU)", "Neuro Surgical Intensive Care Unit (Neuro SICU)"), "ICU/CCU", ADT_Type)) |>
#mutate(ADT_Type = ifelse(ADT_Type %in% c("Neuro Intermediate"), "Neuro", ADT_Type))
lab_info <- labevents_data |>
filter(subject_id == patient_id) |>
select(subject_id, hadm_id, charttime) |>
collect() |>
distinct()
procedure_code_info <- read_csv("~/mimic/hosp/d_icd_procedures.csv.gz")Rows: 85257 Columns: 3
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (2): icd_code, long_title
dbl (1): icd_version
ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
procedure_info <- read_csv("~/mimic/hosp/procedures_icd.csv.gz") |>
filter(subject_id == patient_id) |>
select(subject_id, hadm_id, chartdate, icd_code) |>
mutate(chartdate = as.POSIXct(chartdate, format = "%Y-%m-%d %H:%M:%S")) |>
left_join(procedure_code_info, by = "icd_code") |>
select(-icd_version) |>
arrange(chartdate) |>
mutate(short_title = str_split(long_title, ",") %>% sapply(function(x)
x[1])) |>
distinct()Rows: 669186 Columns: 6
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (1): icd_code
dbl (4): subject_id, hadm_id, seq_num, icd_version
date (1): chartdate
ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
title_text <- sprintf("Patient %d, %s, %d years old, %s",
personal_info$subject_id, personal_info$gender,
personal_info$age, tolower(personal_info$race))
top3_text <- diagnose_info |>
slice(1:3) |>
select(long_title) |>
mutate(first_five_words = str_split(long_title, "\\s+") %>%
sapply(function(x) {
if (length(x) < 5) {
paste(x, collapse = " ")
} else {
paste(x[1:5], collapse = " ")
}
})) |>
pull(first_five_words)
subtitle_text <- sprintf("%s\n%s\n%s", top3_text[1], top3_text[2], top3_text[3])
empty_plot <- ggplot() +
geom_blank() +
scale_y_discrete(limits = c("Procedure", "Lab", "ADT")) +
labs(x = "Calendar Time", title = title_text,
subtitle = tolower(subtitle_text)) +
theme_bw()
ADT_plot <- empty_plot +
geom_segment(data = ADT_info, aes(x = intime, xend = outtime, y = "ADT",
yend = "ADT", color = careunit,
),
linewidth = ifelse(str_detect(ADT_info$careunit, "CCU") |
str_detect(ADT_info$careunit, "ICU"),
8, 3)) +
#scale_color_manual(values = c("Emergency Department" = "red", "Medicine" = "springgreen", "ICU/CCU" = "violet", "Neuro" = "orange", "Medicine/Cardiology" = "skyblue")) +
labs(color = "Care Unit") +
theme(legend.position = "bottom", legend.box = "vertical",
text = element_text(size = 8))
Lab_plot <- ADT_plot +
geom_point(data = lab_info, aes(x = charttime, y = "Lab"),
size = 5, shape = "+")
Procedure_plot <- Lab_plot +
geom_point(data = procedure_info, aes(x = chartdate, y = "Procedure",
shape = short_title), size = 5) +
scale_shape_manual(values =
seq(1, length(unique(procedure_info$short_title)))) +
labs(shape = "Procedure", y = "") +
guides(shape = guide_legend(nrow = 5)) +
theme(legend.position = "bottom", legend.box = "vertical",
text = element_text(size = 8))
plot(Procedure_plot)Answer: This patient has been admitted to the hospital three times. Each admission results in the top 3 diagnoses. Therefore, for the subtitle, we select one of the admission records and displayed the top 3 diagnoses for that admission. Here we select admission record with hadm_id = 21243435 and we only display the first few words of the long title because the long title is too long.
For the label of procedure, since the long title is too long, we only display the words before the first comma of the long title. We manually set the shape of the procedure to be the total 9 procedures. However, since the shape palette can deal with a maximum of 6 discrete values in RStudio, only the first 6 procedures can be displayed in the figure.
Q1.2 ICU stays
ICU stays are a subset of ADT history. This figure shows the vitals of the patient 10001217 during ICU stays. The x-axis is the calendar time, and the y-axis is the value of the vital. The color of the line represents the type of vital. The facet grid shows the abbreviation of the vital and the stay ID.
Do a similar visualization for the patient 10013310.
item_icu <- read_csv("~/mimic/icu/d_items.csv.gz")Rows: 4014 Columns: 9
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (6): label, abbreviation, linksto, category, unitname, param_type
dbl (3): itemid, lownormalvalue, highnormalvalue
ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
zcat < ~/mimic/icu/chartevents.csv.gz | \
awk -F, 'BEGIN{OFS=","} {
if ($7 == 220045 || $7 == 220180 || $7 == 220179 || $7 == 220210 ||
$7 == 223761) print}' | gzip > ./chartevents_filtered.csv.gzpatient_id <- 10013310
item_icu_info <- item_icu |>
filter(itemid %in% c(220045, 220180, 220179, 220210, 223761)) |>
select(itemid, abbreviation) |>
distinct()
chartevents_info <- read_csv("./chartevents_filtered.csv.gz", col_names =
c('subject_id', 'hadm_id',
'stay_id', 'caregiver_id',
'charttime', 'storetime',
'itemid', 'value',
'valuenum', 'valueuom', 'warning')) |>
filter(subject_id == patient_id) |>
select(subject_id, stay_id, itemid, charttime, value) |>
left_join(item_icu_info, by = "itemid")Rows: 22504119 Columns: 11
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (1): valueuom
dbl (8): subject_id, hadm_id, stay_id, caregiver_id, itemid, value, valuenu...
dttm (2): charttime, storetime
ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
title_text <- sprintf("Patient %d ICU stays - Vitals",
chartevents_info$subject_id)
icu_plot <- ggplot(chartevents_info, aes(x = charttime, y = value, color = abbreviation)) +
geom_point(size = 1) +
geom_line() +
labs(title = title_text,
x = "",
y = "") +
facet_grid(rows = vars(abbreviation), cols = vars(stay_id), scales = "free") +
theme_light() +
theme(legend.position = "none") +
scale_x_datetime(date_labels = "%b %d %H:%M") +
guides(x = guide_axis(n.dodge = 2))
plot(icu_plot)Q2. ICU stays
icustays.csv.gz (https://mimic.mit.edu/docs/iv/modules/icu/icustays/) contains data about Intensive Care Units (ICU) stays. The first 10 lines are
zcat < ~/mimic/icu/icustays.csv.gz | headsubject_id,hadm_id,stay_id,first_careunit,last_careunit,intime,outtime,los
10000032,29079034,39553978,Medical Intensive Care Unit (MICU),Medical Intensive Care Unit (MICU),2180-07-23 14:00:00,2180-07-23 23:50:47,0.4102662037037037
10000980,26913865,39765666,Medical Intensive Care Unit (MICU),Medical Intensive Care Unit (MICU),2189-06-27 08:42:00,2189-06-27 20:38:27,0.4975347222222222
10001217,24597018,37067082,Surgical Intensive Care Unit (SICU),Surgical Intensive Care Unit (SICU),2157-11-20 19:18:02,2157-11-21 22:08:00,1.1180324074074075
10001217,27703517,34592300,Surgical Intensive Care Unit (SICU),Surgical Intensive Care Unit (SICU),2157-12-19 15:42:24,2157-12-20 14:27:41,0.9481134259259258
10001725,25563031,31205490,Medical/Surgical Intensive Care Unit (MICU/SICU),Medical/Surgical Intensive Care Unit (MICU/SICU),2110-04-11 15:52:22,2110-04-12 23:59:56,1.338587962962963
10001884,26184834,37510196,Medical Intensive Care Unit (MICU),Medical Intensive Care Unit (MICU),2131-01-11 04:20:05,2131-01-20 08:27:30,9.171817129629629
10002013,23581541,39060235,Cardiac Vascular Intensive Care Unit (CVICU),Cardiac Vascular Intensive Care Unit (CVICU),2160-05-18 10:00:53,2160-05-19 17:33:33,1.3143518518518518
10002155,20345487,32358465,Medical Intensive Care Unit (MICU),Medical Intensive Care Unit (MICU),2131-03-09 21:33:00,2131-03-10 18:09:21,0.8585763888888889
10002155,23822395,33685454,Coronary Care Unit (CCU),Coronary Care Unit (CCU),2129-08-04 12:45:00,2129-08-10 17:02:38,6.178912037037037
Q2.1 Ingestion
Import icustays.csv.gz as a tibble icustays_tble.
icustays_tble <- read_csv("~/mimic/icu/icustays.csv.gz")Rows: 73181 Columns: 8
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (2): first_careunit, last_careunit
dbl (4): subject_id, hadm_id, stay_id, los
dttm (2): intime, outtime
ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
Q2.2 Summary and visualization
How many unique subject_id? Can a subject_id have multiple ICU stays? Summarize the number of ICU stays per subject_id by graphs.
icustays_tble |> distinct(subject_id) |> count()# A tibble: 1 × 1
n
<int>
1 50920
icustays_tble |> count(subject_id) # A tibble: 50,920 × 2
subject_id n
<dbl> <int>
1 10000032 1
2 10000980 1
3 10001217 2
4 10001725 1
5 10001884 1
6 10002013 1
7 10002155 3
8 10002348 1
9 10002428 4
10 10002430 1
# ℹ 50,910 more rows
Answer: There are 50920 unique subject_id. A subject_id can have multiple ICU stays.
icustays_tble |> count(subject_id) |> ggplot(aes(n)) + geom_bar() +
labs(title = "Number of ICU stays per subject_id") + xlab("ICU Stays") +
ylab("Number of subject_id") + theme_bw()Answer: The number of ICU stays per subject_id is shown in the graph above. Most subject_id have 1 ICU stay, and the number of subject_id decreases as the number of ICU stays increases.
Q3. admissions data
Information of the patients admitted into hospital is available in admissions.csv.gz. See https://mimic.mit.edu/docs/iv/modules/hosp/admissions/ for details of each field in this file. The first 10 lines are
zcat < ~/mimic/hosp/admissions.csv.gz | headsubject_id,hadm_id,admittime,dischtime,deathtime,admission_type,admit_provider_id,admission_location,discharge_location,insurance,language,marital_status,race,edregtime,edouttime,hospital_expire_flag
10000032,22595853,2180-05-06 22:23:00,2180-05-07 17:15:00,,URGENT,P874LG,TRANSFER FROM HOSPITAL,HOME,Other,ENGLISH,WIDOWED,WHITE,2180-05-06 19:17:00,2180-05-06 23:30:00,0
10000032,22841357,2180-06-26 18:27:00,2180-06-27 18:49:00,,EW EMER.,P09Q6Y,EMERGENCY ROOM,HOME,Medicaid,ENGLISH,WIDOWED,WHITE,2180-06-26 15:54:00,2180-06-26 21:31:00,0
10000032,25742920,2180-08-05 23:44:00,2180-08-07 17:50:00,,EW EMER.,P60CC5,EMERGENCY ROOM,HOSPICE,Medicaid,ENGLISH,WIDOWED,WHITE,2180-08-05 20:58:00,2180-08-06 01:44:00,0
10000032,29079034,2180-07-23 12:35:00,2180-07-25 17:55:00,,EW EMER.,P30KEH,EMERGENCY ROOM,HOME,Medicaid,ENGLISH,WIDOWED,WHITE,2180-07-23 05:54:00,2180-07-23 14:00:00,0
10000068,25022803,2160-03-03 23:16:00,2160-03-04 06:26:00,,EU OBSERVATION,P51VDL,EMERGENCY ROOM,,Other,ENGLISH,SINGLE,WHITE,2160-03-03 21:55:00,2160-03-04 06:26:00,0
10000084,23052089,2160-11-21 01:56:00,2160-11-25 14:52:00,,EW EMER.,P6957U,WALK-IN/SELF REFERRAL,HOME HEALTH CARE,Medicare,ENGLISH,MARRIED,WHITE,2160-11-20 20:36:00,2160-11-21 03:20:00,0
10000084,29888819,2160-12-28 05:11:00,2160-12-28 16:07:00,,EU OBSERVATION,P63AD6,PHYSICIAN REFERRAL,,Medicare,ENGLISH,MARRIED,WHITE,2160-12-27 18:32:00,2160-12-28 16:07:00,0
10000108,27250926,2163-09-27 23:17:00,2163-09-28 09:04:00,,EU OBSERVATION,P38XXV,EMERGENCY ROOM,,Other,ENGLISH,SINGLE,WHITE,2163-09-27 16:18:00,2163-09-28 09:04:00,0
10000117,22927623,2181-11-15 02:05:00,2181-11-15 14:52:00,,EU OBSERVATION,P2358X,EMERGENCY ROOM,,Other,ENGLISH,DIVORCED,WHITE,2181-11-14 21:51:00,2181-11-15 09:57:00,0
Q3.1 Ingestion
Import admissions.csv.gz as a tibble admissions_tble.
admissions_tble <- read_csv("~/mimic/hosp/admissions.csv.gz")Rows: 431231 Columns: 16
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (8): admission_type, admit_provider_id, admission_location, discharge_l...
dbl (3): subject_id, hadm_id, hospital_expire_flag
dttm (5): admittime, dischtime, deathtime, edregtime, edouttime
ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
Q3.2 Summary and visualization
Summarize the following information by graphics and explain any patterns you see.
- number of admissions per patient
- admission hour (anything unusual?)
- admission minute (anything unusual?)
- length of hospital stay (from admission to discharge) (anything unusual?)
According to the MIMIC-IV documentation,
All dates in the database have been shifted to protect patient confidentiality. Dates will be internally consistent for the same patient, but randomly distributed in the future. Dates of birth which occur in the present time are not true dates of birth. Furthermore, dates of birth which occur before the year 1900 occur if the patient is older than 89. In these cases, the patient’s age at their first admission has been fixed to 300.
admissions_tble |> distinct(subject_id) |> count()# A tibble: 1 × 1
n
<int>
1 180733
admissions_tble |> count(subject_id)# A tibble: 180,733 × 2
subject_id n
<dbl> <int>
1 10000032 4
2 10000068 1
3 10000084 2
4 10000108 1
5 10000117 2
6 10000248 1
7 10000280 1
8 10000560 1
9 10000635 1
10 10000719 1
# ℹ 180,723 more rows
Answer: There are 180733 unique subject_id. A subject_id can have multiple admissions.
admissions_tble |> count(subject_id) |> ggplot(aes(n)) + geom_bar() +
labs(title = "Number of admissions per subject_id") + xlab("Admissions") +
ylab("Number of subject_id") + theme_bw()Answer: The number of admissions per subject_id is shown in the graph above. Most subject_id have 1 admission, and the number of subject_id decreases as the number of admissions increases.
admissions_hour <- hour(admissions_tble$admittime) |> as_tibble()
ggplot(data = admissions_hour, aes(value)) + geom_bar() +
labs(title = "Admission Hour") + xlab("Hour") + ylab("Count") + theme_bw()Answer: The admission hour is shown in the graph above. It seems there is something unusual. The number of admissions is highest at midnight, gradually decreasing over time. However, there is a sudden increase in admissions at 7AM. Then, as time progresses, the number of admissions gradually increases. The evening and night are the peak admission times within a day, with the overall situation being relatively balanced.
admissions_minute <- minute(admissions_tble$admittime) |> as_tibble()
ggplot(data = admissions_minute, aes(value)) + geom_bar() +
labs(title = "Admission Minute") + xlab("Minute") + ylab("Count") + theme_bw()Answer: The admission minute is shown in the graph above. There are very clear unusual patterns. There are significant peaks in admissions at the top of the hour, 15 minutes past, 30 minutes past, and 45 minutes past the hour. At other times, the number of admissions is relatively evenly distributed.
admissions_length <- interval(admissions_tble$dischtime,
admissions_tble$admittime) |> hour() |> as_tibble()Warning in class(xx) <- cl: Setting class(x) to multiple strings ("POSIXct",
"POSIXt", ...); result will no longer be an S4 object
Warning: tz(): Don't know how to compute timezone for object of class Interval;
returning "UTC".
admissions_length |> ggplot(aes(value)) + geom_bar() +
labs(title = "Length of Admission ") + xlab("Length") + ylab("Count") + theme_bw()Answer: The length of hospital stay is shown in the graph above. It appears there is some unusual pattern. The most common length of hospital stay is around 3 hours. As the length of hospital stay increases, the frequency gradually decreases. The least common length is around 14 hours. However, as the length of hospital stay continues to increase, the frequency begins to gradually rise again. The longest length is around 24 hours.
Q4. patients data
Patient information is available in patients.csv.gz. See https://mimic.mit.edu/docs/iv/modules/hosp/patients/ for details of each field in this file. The first 10 lines are
zcat < ~/mimic/hosp/patients.csv.gz | headsubject_id,gender,anchor_age,anchor_year,anchor_year_group,dod
10000032,F,52,2180,2014 - 2016,2180-09-09
10000048,F,23,2126,2008 - 2010,
10000068,F,19,2160,2008 - 2010,
10000084,M,72,2160,2017 - 2019,2161-02-13
10000102,F,27,2136,2008 - 2010,
10000108,M,25,2163,2014 - 2016,
10000115,M,24,2154,2017 - 2019,
10000117,F,48,2174,2008 - 2010,
10000178,F,59,2157,2017 - 2019,
Q4.1 Ingestion
Import patients.csv.gz (https://mimic.mit.edu/docs/iv/modules/hosp/patients/) as a tibble patients_tble.
patients_tble <- read_csv("~/mimic/hosp/patients.csv.gz")Rows: 299712 Columns: 6
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (2): gender, anchor_year_group
dbl (3): subject_id, anchor_age, anchor_year
date (1): dod
ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
Q4.2 Summary and visualization
Summarize variables gender and anchor_age by graphics, and explain any patterns you see.
ggplot(patients_tble, aes(x = gender, fill = gender)) +
geom_bar() +
geom_text(stat='count', aes(label=after_stat(count)),
vjust=-0.5, color='black', size=3) +
labs(title = "Distribution of Gender",
x = "Gender",
y = "Count") +
theme_bw()Answer: The number of female patients is slightly higher than the number of male patients. 158,553 vs 141,159.
ggplot(patients_tble, aes(x = anchor_age)) +
geom_histogram(binwidth = 5, fill = "red", color = "white", alpha = 0.7) +
labs(title = "Distribution of Anchor_Age",
x = "Anchor_Age",
y = "Count") +
theme_bw()ggplot(patients_tble, aes(x = gender, y = anchor_age, fill = gender)) +
geom_violin(color = "white") +
labs(title = "Anchor Age Distribution by Gender",
x = "Gender",
y = "Anchor Age") +
scale_fill_manual(values = c("purple", "orange"))Answer: The distribution of anchor_age is shown in the graph above. The number of patients is highest in the youngest age group, around 20 years old, and the number of patients firstly decreases and then increases as the age increases. There is another peak in the age group 55. And then the number of patients gradually decreases as the age increases. Next, we compare anchor_age in different gender groups. The distribution of anchor_age in different gender groups are almost the same. The largest number of age groups for both gender groups are around 20 years old and then the number of people decreases when they are 40 years old. There is a slight increase in numbers around 60. Finally, as age increases, the number of people decreases for both gender groups.
Q5. Lab results
labevents.csv.gz (https://mimic.mit.edu/docs/iv/modules/hosp/labevents/) contains all laboratory measurements for patients. The first 10 lines are
zcat < ~/mimic/hosp/labevents.csv.gz | headlabevent_id,subject_id,hadm_id,specimen_id,itemid,order_provider_id,charttime,storetime,value,valuenum,valueuom,ref_range_lower,ref_range_upper,flag,priority,comments
1,10000032,,45421181,51237,P28Z0X,2180-03-23 11:51:00,2180-03-23 15:15:00,1.4,1.4,,0.9,1.1,abnormal,ROUTINE,
2,10000032,,45421181,51274,P28Z0X,2180-03-23 11:51:00,2180-03-23 15:15:00,___,15.1,sec,9.4,12.5,abnormal,ROUTINE,VERIFIED.
3,10000032,,52958335,50853,P28Z0X,2180-03-23 11:51:00,2180-03-25 11:06:00,___,15,ng/mL,30,60,abnormal,ROUTINE,NEW ASSAY IN USE ___: DETECTS D2 AND D3 25-OH ACCURATELY.
4,10000032,,52958335,50861,P28Z0X,2180-03-23 11:51:00,2180-03-23 16:40:00,102,102,IU/L,0,40,abnormal,ROUTINE,
5,10000032,,52958335,50862,P28Z0X,2180-03-23 11:51:00,2180-03-23 16:40:00,3.3,3.3,g/dL,3.5,5.2,abnormal,ROUTINE,
6,10000032,,52958335,50863,P28Z0X,2180-03-23 11:51:00,2180-03-23 16:40:00,109,109,IU/L,35,105,abnormal,ROUTINE,
7,10000032,,52958335,50864,P28Z0X,2180-03-23 11:51:00,2180-03-23 16:40:00,___,8,ng/mL,0,8.7,,ROUTINE,MEASURED BY ___.
8,10000032,,52958335,50868,P28Z0X,2180-03-23 11:51:00,2180-03-23 16:40:00,12,12,mEq/L,8,20,,ROUTINE,
9,10000032,,52958335,50878,P28Z0X,2180-03-23 11:51:00,2180-03-23 16:40:00,143,143,IU/L,0,40,abnormal,ROUTINE,
d_labitems.csv.gz (https://mimic.mit.edu/docs/iv/modules/hosp/d_labitems/) is the dictionary of lab measurements.
zcat < ~/mimic/hosp/d_labitems.csv.gz | headitemid,label,fluid,category
50801,Alveolar-arterial Gradient,Blood,Blood Gas
50802,Base Excess,Blood,Blood Gas
50803,"Calculated Bicarbonate, Whole Blood",Blood,Blood Gas
50804,Calculated Total CO2,Blood,Blood Gas
50805,Carboxyhemoglobin,Blood,Blood Gas
50806,"Chloride, Whole Blood",Blood,Blood Gas
50808,Free Calcium,Blood,Blood Gas
50809,Glucose,Blood,Blood Gas
50810,"Hematocrit, Calculated",Blood,Blood Gas
We are interested in the lab measurements of creatinine (50912), potassium (50971), sodium (50983), chloride (50902), bicarbonate (50882), hematocrit (51221), white blood cell count (51301), and glucose (50931). Retrieve a subset of labevents.csv.gz that only containing these items for the patients in icustays_tble. Further restrict to the last available measurement (by storetime) before the ICU stay. The final labevents_tble should have one row per ICU stay and columns for each lab measurement.
Hint: Use the Parquet format you generated in Homework 2. For reproducibility, make labevents_pq folder available at the current working directory hw3, for example, by a symbolic link.
filtered_labevents_parquet <- arrow::open_dataset("./labevents_pq",
format = "parquet") |>
dplyr::filter(itemid %in% c(50912, 50971, 50983, 50902,
50882, 51221, 51301, 50931)) |>
dplyr::filter(subject_id %in% unique(icustays_tble$subject_id)) |>
dplyr::filter(hadm_id %in% unique(icustays_tble$hadm_id)) |>
dplyr::collect()d_labitems <- read_csv("~/mimic/hosp/d_labitems.csv.gz")Rows: 1622 Columns: 4
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (3): label, fluid, category
dbl (1): itemid
ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
d_item_subset <- d_labitems %>%
select(itemid, label)
icustays_tble_subset <- select(icustays_tble, subject_id, stay_id, intime)
labevents_tble <- arrow::open_dataset("./labevents_pq",
format = "parquet") |>
dplyr::filter(itemid %in% c(50912, 50971, 50983, 50902,
50882, 51221, 51301, 50931)) |>
dplyr::select(subject_id, itemid, valuenum, storetime) |>
arrange(subject_id, itemid) |>
dplyr::collect() |>
semi_join(icustays_tble_subset, by = c("subject_id")) |>
left_join(d_item_subset, by = c("itemid")) |>
left_join(icustays_tble_subset, by = c("subject_id")) |>
filter(storetime < intime) |>
group_by(subject_id, label, stay_id) |>
summarise(finalvalue = last(valuenum)) |>
spread(label, finalvalue)Warning in left_join(left_join(semi_join(dplyr::collect(arrange(dplyr::select(dplyr::filter(arrow::open_dataset("./labevents_pq", : Detected an unexpected many-to-many relationship between `x` and `y`.
ℹ Row 845 of `x` matches multiple rows in `y`.
ℹ Row 1 of `y` matches multiple rows in `x`.
ℹ If a many-to-many relationship is expected, set `relationship =
"many-to-many"` to silence this warning.
`summarise()` has grouped output by 'subject_id', 'label'. You can override
using the `.groups` argument.
labevents_tble# A tibble: 68,467 × 10
# Groups: subject_id [47,305]
subject_id stay_id Bicarbonate Chloride Creatinine Glucose Hematocrit
<dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 10000032 39553978 25 95 0.7 102 41.1
2 10000980 39765666 21 109 2.3 89 27.3
3 10001217 34592300 30 104 0.5 87 37.4
4 10001217 37067082 22 108 0.6 112 38.1
5 10001725 31205490 NA 98 NA NA NA
6 10001884 37510196 30 88 1.1 141 39.7
7 10002013 39060235 24 102 0.9 288 34.9
8 10002155 31090461 23 98 2.8 117 25.5
9 10002155 32358465 26 85 1.4 133 22.4
10 10002155 33685454 24 105 1.1 138 39.7
# ℹ 68,457 more rows
# ℹ 3 more variables: Potassium <dbl>, Sodium <dbl>, `White Blood Cells` <dbl>
Q6. Vitals from charted events
chartevents.csv.gz (https://mimic.mit.edu/docs/iv/modules/icu/chartevents/) contains all the charted data available for a patient. During their ICU stay, the primary repository of a patient’s information is their electronic chart. The itemid variable indicates a single measurement type in the database. The value variable is the value measured for itemid. The first 10 lines of chartevents.csv.gz are
zcat < ~/mimic/icu/chartevents.csv.gz | headd_items.csv.gz (https://mimic.mit.edu/docs/iv/modules/icu/d_items/) is the dictionary for the itemid in chartevents.csv.gz.
zcat < ~/mimic/icu/d_items.csv.gz | headWe are interested in the vitals for ICU patients: heart rate (220045), systolic non-invasive blood pressure (220179), diastolic non-invasive blood pressure (220180), body temperature in Fahrenheit (223761), and respiratory rate (220210). Retrieve a subset of chartevents.csv.gz only containing these items for the patients in icustays_tble. Further restrict to the first vital measurement within the ICU stay. The final chartevents_tble should have one row per ICU stay and columns for each vital measurement.
Hint: Use the Parquet format you generated in Homework 2. For reproducibility, make chartevents_pq folder available at the current working directory, for example, by a symbolic link.
Q7. Putting things together
Let us create a tibble mimic_icu_cohort for all ICU stays, where rows are all ICU stays of adults (age at intime >= 18) and columns contain at least following variables
- all variables in
icustays_tble
- all variables in
admissions_tble
- all variables in
patients_tble - the last lab measurements before the ICU stay in
labevents_tble - the first vital measurements during the ICU stay in
chartevents_tble
The final mimic_icu_cohort should have one row per ICU stay and columns for each variable.
Q8. Exploratory data analysis (EDA)
Summarize the following information about the ICU stay cohort mimic_icu_cohort using appropriate numerics or graphs:
Length of ICU stay
losvs demographic variables (race, insurance, marital_status, gender, age at intime)Length of ICU stay
losvs the last available lab measurements before ICU stayLength of ICU stay
losvs the average vital measurements within the first hour of ICU stayLength of ICU stay
losvs first ICU unit